Here, we read the gapminder_clean.csv file using tidyverse function read_csv()
We filter the gapminder dataset to only include data from the year 1962, then plot the log transformed (to reduce skewness) CO2 emissions against gdpPerCap emissions
gapminder_filtered <- filter(gapminder, Year == 1962)
gapminder_plot <- ggplot(gapminder_filtered, aes(x = log(gdpPercap),
y = log(`CO2 emissions (metric tons per capita)`))) +
labs(x = 'Log 10 gdpPercap', y = 'Log 10 CO2 emissions (metric tons per capita)')+
geom_point()+
theme(text = element_text(size = 10))
gapminder_plotHere, we use the cor() function to calculate the Pearson correlation coefficient between CO2 emissions (metric tons per capita) and gdpPercap for the year 1962. Then, we calculate the p_value of the correlation coefficient.
correlation <- cor(gapminder_filtered$`CO2 emissions (metric tons per capita)`,
gapminder_filtered$gdpPercap, use = "complete.obs")
p_value <- cor.test(gapminder_filtered$`CO2 emissions (metric tons per capita)`,
gapminder_filtered$gdpPercap)$p.value
print(paste("The correlation coefficient is", as.character(correlation)))## [1] "The correlation coefficient is 0.926081672501945"
## [1] "The p-value is 1.1286792210055e-46"
First, we filter out points in the data with incomplete points for CO2 emissions (metric tons per capita)’ and gdpPercap. Next, we calculate the correlation for each year, and search for the year with the greatest correlation, which gives us the year with the strongest correlation
We plot the log transformed CO2 emissions (metric tons per capita) against the log transformed gdpPercap, while the size is determined by the population and the color by the continent.
We want to test if there is a statistically significant difference in the Energy use per continent. Thus, continent is the predictor variable, and energy use is the outcome variable.
To use a parametric test, we must ensure that three assumptions are met: Normality, equal variances, and independence.
We can visually assess normality using histograms of energy use for each continent
ggplot(gapminder, aes(x = `Energy use (kg of oil equivalent per capita)`)) +
geom_histogram(bins = 30) +
facet_wrap(~ continent, scales = "free") +
xlab("Energy use (kg of oil equivalent per capita)") +
ylab("Frequency")As seen from the histograms, the data is not normally distributed for each continent. Therefore, we can use the non-parametric Kruskal-Wallis test, which does not assume normality to test for a significant difference in variance between continents for Energy use.
##
## Kruskal-Wallis rank sum test
##
## data: gapminder$`Energy use (kg of oil equivalent per capita)` and gapminder$continent
## Kruskal-Wallis chi-squared = 318.68, df = 4, p-value < 2.2e-16
As we can see, the p value is less than 2.2e-16, which is less than 0.05. Therefore, the energy use varies significantly between at least two continents.
Here, we filter the points past 1990, and create a density plot and boxplot to compare Europe and Asia’s imports of goods and services (% of GDP). This way, we can visually assess the difference in Imports of goods and services (% of GDP)
# Create box plots
box_plot <- ggplot(gapminder_years, aes(x = continent, y = `Imports of goods and services (% of GDP)`, fill = continent)) +
geom_boxplot() +
labs(x = "Continent", y = "Imports of goods and services (% of GDP)", fill = "Continent") +
ggtitle("Box Plot of GDP Imports by Continent")
ggplotly(box_plot)# Create density plots
density_plot <- ggplot(gapminder_years, aes(x = `Imports of goods and services (% of GDP)`, fill = continent)) +
geom_density(alpha = 0.5) +
labs(x = "Imports of goods and services (% of GDP)", fill = "Continent") +
ggtitle("Density Plot of GDP Imports by Continent")
ggplotly(density_plot)Visually, the two continent’s import of goods and services are very close with overlapping peaks, although the variance appears to vary between the two continents.
We plot a qqplot for asia and europe’s import of goods and services to assess normality. If the points fall in a relatively straight line, we can conclude that the data is normally distributed.
For Asia, there are a few points with a high GDP above the diagonal line on the right. Those points do not fall in a straight line, so we can conclude that normality has been violated, and it would not be appropriate to use a parametric test. Therefore, so we use the non-parametric Mann-Whitney-Wilcoxon Test, which does not assume normality.
We perform the Mann-Whitney-Wilcoxon Test, to test for the equality of means between Europe and Asia for the import of goods and services.
result <- wilcox.test(`Imports of goods and services (% of GDP)` ~ continent, data = gapminder_years)
print(result)##
## Wilcoxon rank sum test with continuity correction
##
## data: Imports of goods and services (% of GDP) by continent
## W = 5707, p-value = 0.7867
## alternative hypothesis: true location shift is not equal to 0
As the p value of 0.7867 is greater than 0.05, we did not find a significant difference in ‘Imports of goods and services (% of GDP)’ between Europe and Asia.
'Population density (people per sq. km of land area)'
across all years? (i.e., which country has the highest average ranking
in this category across each time point in the dataset?)We plot a bar plot containing the average rank of the top 5 countries with the highest average ranking.
# Getting the ranked populations descending (the country with the greatest population density would be #1)
gapminder_pd <- gapminder %>%
group_by(`Year`) %>%
mutate(population_rank = rank(-`Population density (people per sq. km of land area)`)) %>%
dplyr::select(population_rank, `Country Name`)
# Taking the average population rank
gapminder_rank_mean <- gapminder_pd %>%
group_by(`Country Name`) %>%
summarize(mean_population_density_rank = mean(population_rank)) %>%
arrange(mean_population_density_rank) %>%
slice(1:5)
gapminder_rank_mean_plot <- ggplot(data = gapminder_rank_mean, aes(x = `Country Name`, y = mean_population_density_rank)) +
geom_bar(stat = 'Identity') +
labs(title = "Countries with greatest average ranking for population density", y = 'Mean rank of country')+
coord_flip()
gapminder_rank_mean_plotAs seen from bar chart, Macao SAR, China and Monaco are tied for
having the highest
'Population density (people per sq. km of land area)'
across all years; each had an average rank of 1.5.
'Life expectancy at birth, total (years)' between 1962 and
2007?We can plot a bar chart of the top 5 countries with the greatest difference in life expectancy at birth between 1962 and 2007.
# Get the top 5 countries with the greatest increase in life expectancies
gapminder_difference <-gapminder %>%
filter(Year %in% c(1962, 2007)) %>%
group_by(`Country Name`)%>%
arrange((`Life expectancy at birth, total (years)`)) %>%
reframe(`Difference in Life expectancy (2007 - 1962)` = diff(`Life expectancy at birth, total (years)`)) %>%
arrange(desc(`Difference in Life expectancy (2007 - 1962)`)) %>%
slice(1:5)
# Plotting the top 5 countries
gapminder_difference_plot <- gapminder_difference %>%
ggplot(aes(x = `Country Name`, y = `Difference in Life expectancy (2007 - 1962)`)) +
geom_bar(stat = "identity") +
labs(title = "Top 5 countries with the greatest increase in life expectancy from 1962 to 2007", y = 'Difference in Life expectancy in years for (2007 - 1962)')+
coord_flip()
ggplotly(gapminder_difference_plot)As seen from the above bar chart, the country whose life expectancy increased the most from 1962 - 2007 is Maldives